Clustering with scikit-learn

In this notebook, we will learn how to perform k-means lustering using scikit-learn in Python.

We will use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data. In this dataset, we have in the order of millions records. How do we create 12 clusters our of them?

NOTE: The dataset we will use is in a large CSV file called minute_weather.csv. Please download it into the weather directory in your Week-7-MachineLearning folder. The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg

Importing the Necessary Libraries



In [1]:

    
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
#import utils
import pandas as pd
import numpy as np
from itertools import cycle, islice
import matplotlib.pyplot as plt
from pandas.tools.plotting import parallel_coordinates

%matplotlib inline

Creating a Pandas DataFrame from a CSV file



In [2]:

    
data = pd.read_csv('./weather/minute_weather.csv')

Minute Weather Data Description

The minute weather dataset comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file minute_weather.csv, which is a comma-separated file.

As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:

rowID: unique number for each row (Unit: NA)
hpwren_timestamp: timestamp of measure (Unit: year-month-day hour:minute:second)
air_pressure: air pressure measured at the timestamp (Unit: hectopascals)
air_temp: air temperature measure at the timestamp (Unit: degrees Fahrenheit)
avg_wind_direction: wind direction averaged over the minute before the timestamp (Unit: degrees, with 0 means coming from the North, and increasing clockwise)
avg_wind_speed: wind speed averaged over the minute before the timestamp (Unit: meters per second)
max_wind_direction: highest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and increasing clockwise)
max_wind_speed: highest wind speed in the minute before the timestamp (Unit: meters per second)
min_wind_direction: smallest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and inceasing clockwise)
min_wind_speed: smallest wind speed in the minute before the timestamp (Unit: meters per second)
rain_accumulation: amount of accumulated rain measured at the timestamp (Unit: millimeters)
rain_duration: length of time rain has fallen as measured at the timestamp (Unit: seconds)
relative_humidity: relative humidity measured at the timestamp (Unit: percent)



In [3]:

    
data.shape









    Out[3]:





(1587257, 13)



In [4]:

    
data.head()









    Out[4]:






  
    
      
      rowID
      hpwren_timestamp
      air_pressure
      air_temp
      avg_wind_direction
      avg_wind_speed
      max_wind_direction
      max_wind_speed
      min_wind_direction
      min_wind_speed
      rain_accumulation
      rain_duration
      relative_humidity
    
  
  
    
      0
      0
      2011-09-10 00:00:49
      912.3
      64.76
      97.0
      1.2
      106.0
      1.6
      85.0
      1.0
      NaN
      NaN
      60.5
    
    
      1
      1
      2011-09-10 00:01:49
      912.3
      63.86
      161.0
      0.8
      215.0
      1.5
      43.0
      0.2
      0.0
      0.0
      39.9
    
    
      2
      2
      2011-09-10 00:02:49
      912.3
      64.22
      77.0
      0.7
      143.0
      1.2
      324.0
      0.3
      0.0
      0.0
      43.0
    
    
      3
      3
      2011-09-10 00:03:49
      912.3
      64.40
      89.0
      1.2
      112.0
      1.6
      12.0
      0.7
      0.0
      0.0
      49.5
    
    
      4
      4
      2011-09-10 00:04:49
      912.3
      64.40
      185.0
      0.4
      260.0
      1.0
      100.0
      0.1
      0.0
      0.0
      58.8

Data Sampling

Lots of rows, so let us sample down by taking every 10th row.



In [5]:

    
sampled_df = data[(data['rowID'] % 10) == 0]
sampled_df.shape









    Out[5]:





(158726, 13)

Statistics



In [6]:

    
sampled_df.describe().transpose()









    Out[6]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      rowID
      158726.0
      793625.000000
      458203.937509
      0.00
      396812.5
      793625.00
      1190437.50
      1587250.00
    
    
      air_pressure
      158726.0
      916.830161
      3.051717
      905.00
      914.8
      916.70
      918.70
      929.50
    
    
      air_temp
      158726.0
      61.851589
      11.833569
      31.64
      52.7
      62.24
      70.88
      99.50
    
    
      avg_wind_direction
      158680.0
      162.156100
      95.278201
      0.00
      62.0
      182.00
      217.00
      359.00
    
    
      avg_wind_speed
      158680.0
      2.775215
      2.057624
      0.00
      1.3
      2.20
      3.80
      31.90
    
    
      max_wind_direction
      158680.0
      163.462144
      92.452139
      0.00
      68.0
      187.00
      223.00
      359.00
    
    
      max_wind_speed
      158680.0
      3.400558
      2.418802
      0.10
      1.6
      2.70
      4.60
      36.00
    
    
      min_wind_direction
      158680.0
      166.774017
      97.441109
      0.00
      76.0
      180.00
      212.00
      359.00
    
    
      min_wind_speed
      158680.0
      2.134664
      1.742113
      0.00
      0.8
      1.60
      3.00
      31.60
    
    
      rain_accumulation
      158725.0
      0.000318
      0.011236
      0.00
      0.0
      0.00
      0.00
      3.12
    
    
      rain_duration
      158725.0
      0.409627
      8.665523
      0.00
      0.0
      0.00
      0.00
      2960.00
    
    
      relative_humidity
      158726.0
      47.609470
      26.214409
      0.90
      24.7
      44.70
      68.00
      93.00



In [7]:

    
sampled_df[sampled_df['rain_accumulation'] == 0].shape









    Out[7]:





(157812, 13)



In [8]:

    
sampled_df[sampled_df['rain_duration'] == 0].shape









    Out[8]:





(157237, 13)

Drop all the Rows with Empty rain_duration and rain_accumulation



In [9]:

    
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']



In [10]:

    
rows_before = sampled_df.shape[0]
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]

How many rows did we drop ?



In [11]:

    
rows_before - rows_after









    Out[11]:





46



In [12]:

    
sampled_df.columns









    Out[12]:





Index(['rowID', 'hpwren_timestamp', 'air_pressure', 'air_temp',
       'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction',
       'max_wind_speed', 'min_wind_direction', 'min_wind_speed',
       'relative_humidity'],
      dtype='object')

Select Features of Interest for Clustering



In [13]:

    
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']



In [14]:

    
select_df = sampled_df[features]



In [15]:

    
select_df.columns









    Out[15]:





Index(['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed',
       'max_wind_direction', 'max_wind_speed', 'relative_humidity'],
      dtype='object')



In [16]:

    
select_df









    Out[16]:






  
    
      
      air_pressure
      air_temp
      avg_wind_direction
      avg_wind_speed
      max_wind_direction
      max_wind_speed
      relative_humidity
    
  
  
    
      0
      912.3
      64.76
      97.0
      1.2
      106.0
      1.6
      60.5
    
    
      10
      912.3
      62.24
      144.0
      1.2
      167.0
      1.8
      38.5
    
    
      20
      912.2
      63.32
      100.0
      2.0
      122.0
      2.5
      58.3
    
    
      30
      912.2
      62.60
      91.0
      2.0
      103.0
      2.4
      57.9
    
    
      40
      912.2
      64.04
      81.0
      2.6
      88.0
      2.9
      57.4
    
    
      50
      912.1
      63.68
      102.0
      1.2
      119.0
      1.5
      51.4
    
    
      60
      912.0
      64.04
      83.0
      0.7
      101.0
      0.9
      51.4
    
    
      70
      911.9
      64.22
      82.0
      2.0
      97.0
      2.4
      62.2
    
    
      80
      911.9
      61.70
      67.0
      3.3
      70.0
      3.5
      71.5
    
    
      90
      911.9
      61.34
      67.0
      3.6
      75.0
      4.2
      72.5
    
    
      100
      911.8
      62.96
      95.0
      2.3
      106.0
      2.5
      63.9
    
    
      110
      911.8
      64.22
      83.0
      2.1
      88.0
      2.5
      59.1
    
    
      120
      911.8
      63.86
      68.0
      2.1
      76.0
      2.4
      63.5
    
    
      130
      911.6
      64.40
      156.0
      0.5
      203.0
      0.7
      50.4
    
    
      140
      911.5
      65.30
      85.0
      2.2
      92.0
      2.5
      58.0
    
    
      150
      911.4
      64.58
      154.0
      1.3
      176.0
      2.1
      50.2
    
    
      160
      911.4
      65.48
      154.0
      0.9
      208.0
      1.9
      46.2
    
    
      170
      911.5
      65.66
      95.0
      1.1
      109.0
      1.6
      45.2
    
    
      180
      911.4
      65.66
      155.0
      1.1
      167.0
      1.6
      42.8
    
    
      190
      911.4
      67.10
      157.0
      1.2
      172.0
      1.6
      36.8
    
    
      200
      911.4
      68.00
      53.0
      0.3
      69.0
      0.5
      33.4
    
    
      210
      911.3
      67.64
      167.0
      1.5
      196.0
      2.2
      34.4
    
    
      220
      911.4
      67.82
      4.0
      0.6
      25.0
      0.7
      34.2
    
    
      230
      911.4
      66.74
      172.0
      1.3
      192.0
      1.9
      37.8
    
    
      240
      911.4
      66.56
      39.0
      0.2
      145.0
      0.3
      41.6
    
    
      250
      911.4
      65.66
      56.0
      1.9
      67.0
      2.2
      51.8
    
    
      260
      911.5
      65.66
      74.0
      0.8
      101.0
      1.2
      41.1
    
    
      270
      911.4
      66.92
      147.0
      0.9
      174.0
      1.1
      36.0
    
    
      280
      911.3
      64.76
      73.0
      1.0
      82.0
      1.2
      43.3
    
    
      290
      911.3
      64.94
      164.0
      1.3
      176.0
      1.7
      43.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1586960
      914.7
      76.46
      247.0
      0.6
      264.0
      0.7
      43.4
    
    
      1586970
      914.8
      76.28
      208.0
      0.7
      216.0
      0.9
      43.7
    
    
      1586980
      914.8
      76.10
      209.0
      0.7
      216.0
      0.9
      43.9
    
    
      1586990
      914.9
      76.28
      339.0
      0.5
      350.0
      0.7
      43.4
    
    
      1587000
      914.9
      75.92
      344.0
      0.4
      352.0
      0.6
      43.9
    
    
      1587010
      915.0
      75.56
      323.0
      0.3
      348.0
      0.5
      45.5
    
    
      1587020
      915.1
      75.56
      324.0
      1.1
      347.0
      1.5
      46.0
    
    
      1587030
      915.1
      75.74
      1.0
      1.3
      13.0
      1.7
      45.8
    
    
      1587040
      915.2
      75.38
      355.0
      0.9
      1.0
      1.1
      46.1
    
    
      1587050
      915.3
      75.38
      359.0
      1.4
      11.0
      1.5
      45.8
    
    
      1587060
      915.4
      75.38
      11.0
      1.1
      21.0
      1.3
      45.7
    
    
      1587070
      915.5
      75.38
      13.0
      1.4
      24.0
      1.6
      46.6
    
    
      1587080
      915.6
      75.20
      18.0
      1.0
      24.0
      1.2
      46.5
    
    
      1587090
      915.6
      75.20
      356.0
      1.7
      1.0
      1.9
      47.2
    
    
      1587100
      915.7
      75.38
      13.0
      1.5
      24.0
      1.7
      46.7
    
    
      1587110
      915.7
      75.02
      19.0
      1.2
      28.0
      1.4
      46.7
    
    
      1587120
      915.7
      74.84
      25.0
      1.4
      35.0
      1.6
      46.5
    
    
      1587130
      915.8
      74.84
      23.0
      1.3
      30.0
      1.5
      46.9
    
    
      1587140
      915.8
      74.84
      32.0
      1.4
      41.0
      1.7
      45.5
    
    
      1587150
      915.8
      75.20
      23.0
      1.1
      31.0
      1.4
      45.7
    
    
      1587160
      915.8
      75.38
      16.0
      1.2
      28.0
      1.5
      46.3
    
    
      1587170
      915.7
      75.38
      347.0
      1.2
      353.0
      1.4
      48.1
    
    
      1587180
      915.8
      75.74
      326.0
      1.2
      337.0
      1.6
      48.3
    
    
      1587190
      915.9
      75.92
      289.0
      0.7
      309.0
      0.9
      48.1
    
    
      1587200
      915.9
      75.74
      335.0
      0.9
      348.0
      1.1
      47.8
    
    
      1587210
      915.9
      75.56
      330.0
      1.0
      341.0
      1.3
      47.8
    
    
      1587220
      915.9
      75.56
      330.0
      1.1
      341.0
      1.4
      48.0
    
    
      1587230
      915.9
      75.56
      344.0
      1.4
      352.0
      1.7
      48.0
    
    
      1587240
      915.9
      75.20
      359.0
      1.3
      9.0
      1.6
      46.3
    
    
      1587250
      915.9
      74.84
      6.0
      1.5
      20.0
      1.9
      46.1
    
  

158680 rows × 7 columns

Scale the Features using StandardScaler



In [17]:

    
X = StandardScaler().fit_transform(select_df)
X









    Out[17]:





array([[-1.48456281,  0.24544455, -0.68385323, ..., -0.62153592,
        -0.74440309,  0.49233835],
       [-1.48456281,  0.03247142, -0.19055941, ...,  0.03826701,
        -0.66171726, -0.34710804],
       [-1.51733167,  0.12374562, -0.65236639, ..., -0.44847286,
        -0.37231683,  0.40839371],
       ..., 
       [-0.30488381,  1.15818654,  1.90856325, ...,  2.0393087 ,
        -0.70306017,  0.01538018],
       [-0.30488381,  1.12776181,  2.06599745, ..., -1.67073075,
        -0.74440309, -0.04948614],
       [-0.30488381,  1.09733708, -1.63895404, ..., -1.55174989,
        -0.62037434, -0.05711747]])

Use k-Means Clustering



In [18]:

    
kmeans = KMeans(n_clusters=12)
model = kmeans.fit(X)
print("model\n", model)









    



model
 KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=12, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

What are the centers of 12 clusters we formed ?



In [22]:

    
centers = model.cluster_centers_
centers









    Out[22]:





array([[ 0.23414265,  0.32088768,  1.88794018, -0.65174648, -1.55179078,
        -0.57663111, -0.28415129],
       [-0.21101616,  0.63388782,  0.40861282,  0.73377418,  0.51681119,
         0.67191387, -0.15134073],
       [-0.6969227 ,  0.54216256,  0.17702903, -0.5840737 ,  0.34628434,
        -0.59745567, -0.11354633],
       [-1.18235195, -0.86948308,  0.4468512 ,  1.98489163,  0.53827387,
         1.94597277,  0.90759772],
       [ 0.73141523,  0.43294657,  0.28515211, -0.5344004 ,  0.47287075,
        -0.5407336 , -0.76947082],
       [-0.16068847,  0.86265214, -1.31098811, -0.58986313, -1.16663766,
        -0.60518798, -0.64293243],
       [ 1.36987489, -0.08376038, -1.20690989, -0.0454475 , -1.07590457,
        -0.02492413, -0.97762873],
       [ 0.23733228, -0.99817197,  0.65636998, -0.54708994,  0.84558209,
        -0.52972043,  1.16473384],
       [ 0.06019157, -0.78770058, -1.19735701, -0.5706887 , -1.04352902,
        -0.58526678,  0.87793487],
       [ 1.18984935, -0.25485028, -1.15497786,  2.12621668, -1.05348987,
         2.24320671, -1.13475959],
       [ 0.13216168,  0.84256194,  1.41031142, -0.63874493,  1.67440929,
        -0.58952661, -0.71342998],
       [-0.8393307 , -1.200436  ,  0.37569195,  0.37534678,  0.47419514,
         0.36282493,  1.36099638]])

Plots

Let us first create some utility functions which will help us in plotting graphs:



In [23]:

    
# Function that creates a DataFrame with a column for Cluster Number

def pd_centers(featuresUsed, centers):
	colNames = list(featuresUsed)
	colNames.append('prediction')

	# Zip with a column called 'prediction' (index)
	Z = [np.append(A, index) for index, A in enumerate(centers)]

	# Convert to pandas data frame for plotting
	P = pd.DataFrame(Z, columns=colNames)
	P['prediction'] = P['prediction'].astype(int)
	return P



In [47]:

    
# Function that creates Parallel Plots

def parallel_plot(data):
    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
    #print(my_colors)
    plt.figure(figsize=(15,8)).gca().axes.set_ylim([-3,+3])
    parallel_coordinates(data, 'prediction', color = my_colors, marker='o')



In [48]:

    
P = pd_centers(features, centers)
P









    Out[48]:






  
    
      
      air_pressure
      air_temp
      avg_wind_direction
      avg_wind_speed
      max_wind_direction
      max_wind_speed
      relative_humidity
      prediction
    
  
  
    
      0
      0.234143
      0.320888
      1.887940
      -0.651746
      -1.551791
      -0.576631
      -0.284151
      0
    
    
      1
      -0.211016
      0.633888
      0.408613
      0.733774
      0.516811
      0.671914
      -0.151341
      1
    
    
      2
      -0.696923
      0.542163
      0.177029
      -0.584074
      0.346284
      -0.597456
      -0.113546
      2
    
    
      3
      -1.182352
      -0.869483
      0.446851
      1.984892
      0.538274
      1.945973
      0.907598
      3
    
    
      4
      0.731415
      0.432947
      0.285152
      -0.534400
      0.472871
      -0.540734
      -0.769471
      4
    
    
      5
      -0.160688
      0.862652
      -1.310988
      -0.589863
      -1.166638
      -0.605188
      -0.642932
      5
    
    
      6
      1.369875
      -0.083760
      -1.206910
      -0.045448
      -1.075905
      -0.024924
      -0.977629
      6
    
    
      7
      0.237332
      -0.998172
      0.656370
      -0.547090
      0.845582
      -0.529720
      1.164734
      7
    
    
      8
      0.060192
      -0.787701
      -1.197357
      -0.570689
      -1.043529
      -0.585267
      0.877935
      8
    
    
      9
      1.189849
      -0.254850
      -1.154978
      2.126217
      -1.053490
      2.243207
      -1.134760
      9
    
    
      10
      0.132162
      0.842562
      1.410311
      -0.638745
      1.674409
      -0.589527
      -0.713430
      10
    
    
      11
      -0.839331
      -1.200436
      0.375692
      0.375347
      0.474195
      0.362825
      1.360996
      11

Dry Days



In [49]:

    
parallel_plot(P[P['relative_humidity'] < -0.5])

Warm Days



In [50]:

    
parallel_plot(P[P['air_temp'] > 0.5])

Cool Days



In [51]:

    
parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])

	rowID	hpwren_timestamp	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	min_wind_direction	min_wind_speed	rain_accumulation	rain_duration	relative_humidity
0	0	2011-09-10 00:00:49	912.3	64.76	97.0	1.2	106.0	1.6	85.0	1.0	NaN	NaN	60.5
1	1	2011-09-10 00:01:49	912.3	63.86	161.0	0.8	215.0	1.5	43.0	0.2	0.0	0.0	39.9
2	2	2011-09-10 00:02:49	912.3	64.22	77.0	0.7	143.0	1.2	324.0	0.3	0.0	0.0	43.0
3	3	2011-09-10 00:03:49	912.3	64.40	89.0	1.2	112.0	1.6	12.0	0.7	0.0	0.0	49.5
4	4	2011-09-10 00:04:49	912.3	64.40	185.0	0.4	260.0	1.0	100.0	0.1	0.0	0.0	58.8

	count	mean	std	min	25%	50%	75%	max
rowID	158726.0	793625.000000	458203.937509	0.00	396812.5	793625.00	1190437.50	1587250.00
air_pressure	158726.0	916.830161	3.051717	905.00	914.8	916.70	918.70	929.50
air_temp	158726.0	61.851589	11.833569	31.64	52.7	62.24	70.88	99.50
avg_wind_direction	158680.0	162.156100	95.278201	0.00	62.0	182.00	217.00	359.00
avg_wind_speed	158680.0	2.775215	2.057624	0.00	1.3	2.20	3.80	31.90
max_wind_direction	158680.0	163.462144	92.452139	0.00	68.0	187.00	223.00	359.00
max_wind_speed	158680.0	3.400558	2.418802	0.10	1.6	2.70	4.60	36.00
min_wind_direction	158680.0	166.774017	97.441109	0.00	76.0	180.00	212.00	359.00
min_wind_speed	158680.0	2.134664	1.742113	0.00	0.8	1.60	3.00	31.60
rain_accumulation	158725.0	0.000318	0.011236	0.00	0.0	0.00	0.00	3.12
rain_duration	158725.0	0.409627	8.665523	0.00	0.0	0.00	0.00	2960.00
relative_humidity	158726.0	47.609470	26.214409	0.90	24.7	44.70	68.00	93.00

	air_pressure	air_temp	avg_wind_direction	avg_wind_speed	max_wind_direction	max_wind_speed	relative_humidity	prediction
0	0.234143	0.320888	1.887940	-0.651746	-1.551791	-0.576631	-0.284151	0
1	-0.211016	0.633888	0.408613	0.733774	0.516811	0.671914	-0.151341	1
2	-0.696923	0.542163	0.177029	-0.584074	0.346284	-0.597456	-0.113546	2
3	-1.182352	-0.869483	0.446851	1.984892	0.538274	1.945973	0.907598	3
4	0.731415	0.432947	0.285152	-0.534400	0.472871	-0.540734	-0.769471	4
5	-0.160688	0.862652	-1.310988	-0.589863	-1.166638	-0.605188	-0.642932	5
6	1.369875	-0.083760	-1.206910	-0.045448	-1.075905	-0.024924	-0.977629	6
7	0.237332	-0.998172	0.656370	-0.547090	0.845582	-0.529720	1.164734	7
8	0.060192	-0.787701	-1.197357	-0.570689	-1.043529	-0.585267	0.877935	8
9	1.189849	-0.254850	-1.154978	2.126217	-1.053490	2.243207	-1.134760	9
10	0.132162	0.842562	1.410311	-0.638745	1.674409	-0.589527	-0.713430	10
11	-0.839331	-1.200436	0.375692	0.375347	0.474195	0.362825	1.360996	11